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Introduction 



The purpose of this paper is to review discriminant analysis in 
terms of (1) formulations, (2) Interpretations, (3) uses, (4) Issues and 
problems in applications, (5) recent developments and conceptualizations, 
and (6) general references and computer programs. 

A complete review of literature related to the development of various 
aspects of discriminant analysis will not be attempted in this paper. Excel- 
lent reviews have already been written which take the interested reade^ 
from the pre-Flsher conceptualization of the two-group classification problem, 
thru R-group formulations, to the use of discriminant analysis as a more 
general multivariate data analysis technique. The very comprehensive review 
by Hodges (1950) , which focuses on the use of discriminant analysis for 
classification purposes, covers an historical development in the Pearsonian 
stage, dealing with measures of resemblance; the Fisherian stage, dealing with 

r 

the linear discriminant function; the Neyman-Pearson stage, dealing with 
probabil;t.ties of misclassif icatlon; and the Waldian stage, dealing with 
risk and minlmax ideas in classification. The review by Tatsuoka and 
Tiedeman (1954) covers developments in the area of classification as well 
as the relationship of discriminant analysis to other aspects of multivariate 
data analysis. In particular, they review C. F.. Rao*s conceptualization 
of the problem, extensions of R. A. Fisher's linear discriminant function, 
and the integration of the two, A brief review of early work in classifica- 
tion is provided by Ottman, Ferguson, and Kaufman (1956), along with an 
application of Rao *s classification equations. A more recent review of 
classification theory and methodology is given by Das Gupta (1973). 
The review includes sections on the early history of classification 



[basically a summary of Hodges' review], general classification problems and 
theory (Including empirical Bayes approaches), multivariate nor&l classifi- 
cation, and non-normal distributions and nonparametrlc methods. An exten$lve 
and fairly recent bibliography Is provided; references are listed for each 
section. 

The formal relationship of the mathematics underlying the linear 
discriminant function "function" here is not used in a mathematical sense*- 
to other techniques in the domain of multivariate analysis was noted some 
years ago by, e.g., Bartlett (19A7) and Tlntner (1950), and more recently 
by Cooley and Lohnes (1971) , Tatsuoka (1971), Van de Geer C1971) , and Mulalk 
(1972). Despite the relationship, applications of "discriminant analysis" 
have in the past been somewhat divorced from other multivariate techniques, 
with classification being the primary concerii. However, the use of dlscrim* 
inant analysis as an aid in characterizing group differences is seen as a 
very important extension from that as a mere classif icatory tool. In his 
brief review, Tatsuoka (1969) states that the extension of discriminant 
analysis "...as a follow-up to KANOVA is probably one of the most significant 
developments in multivariate analysis during the past ten years, [p. 742]." 
Specific uses of discriminant analysis in relation to multivariate analysis 
of variance (MANOVA) , and methods of interpreting (linear) discriminant 
functions are cogently reviewed by Tatsuoka (1973a). 

In a somewhat restrictive view discriminant analysis had been considered 
in light of a mathematical problem. In this sense the idea was to simplify 
a multivariate situation to a univariate one. That is, given K well- 
defined groups and p measures on each individual in each group ^ the objective 
was to determine a (linear) composite of the p measures which would maximize 
the between-group variance of the composite relative to the within-group 
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variance* Once mathematical formulations of the basic problem wer^, in 
some ways, satisfactorily performed, applied statisticians and data analysts 
began to utilize them in various ways. 

Aspects of Discriminant Analysis 

In different areas of applica^. ns the term "discriminant analysis" has 
come to imply distinct meanings, uses, roles, etc. In the fields of learning, 
psychology, guidance, and others, it has been used for prediction (e.g., 
Alexakos, 1966; Chastlan, 1969; Stahmann» 1969); in the study of classroom 
instruction it has been used as a variable reduction technique (e.g., Anderson, 
et al. , 1969); and in various fields it has been used as an adjunct to 
MANOVA (e.g., Saupe, 1965; Spain and D'Costa, 1970). The term is now be- 
ginning to be interpreted as a unified approach in the solution of a reasearch 
problem involving a comparison of two or more populations characterized by 
multi*-response data. 

Discriminant analysis as a general research technique can be very useful 
in the investigation of various apsects of a multivariate research problem. 
In the early 1950* s Tatsuoka ancl Tiedeman (1954) emphasized the multi-phasic 
character of discriminant analysis: "(a) the establishment of significant 
group-differences, (b) the study and * explanation^ of these olfferences, 
and finally (c) the utilization of multivariate information from the samples 
studied in classifying a future individual known to belong to one of the 
groups represented tp. 414]." Essentially these same three problems related 
to discriminatory analysis were mentioned some years later by Kunnally (1967, 
p. 388). 

As a means of clarity in communication in thi^paper, four aspects of 
a "discriminant analysis" will be considered. They are (1) separation — 
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determining inter-*group significant differences in terms of group centroids, 
(i.e. , mean vectors) , (2) discrimination — studying group separation with 
respect to dimensions and to (discriminator) variable contribution to 
separation, (3) estimation — obtaining estimates of inter-population dis- 
tances (between centroids) and of degree of relationship between the response 
variables and group membership, and (4) classification — setting up rules 
of assigning an individual to one of the pre-determlned exhaustive popula- 
tions. It should be noted that this terminology differs from that used by 
other writers. Of course, separation is usually thought of In terms of 
significance testing via MANOVA; in fact one-way MANOVA and ''discriminant 
analysis" are sometimes considered synonymous (HcCall, 1970, p. 1373). 
Discrimination as used here actually refers to methods of interpreting linear 
discriminant functions and their coefficients. This term has been used by 
others as the equivalent to what in this paper is called classification 
(Kendall, 1966, 1973; Kshirsagar, 1972). Rather than "classification," 
Rao (1965) uses "identification," while Kendall (1966, 1973) and Harman 
(1971) use "classification" as what is often referred to hy behavioral 
scientists as "cluster analysis." The inclusion of estimation as an addi- 
tional aspect was done for the purpose of emphasizing supplementary means 
of interpreting the results of a discriminant analysis. 

Separation 

The basics of MANOVA as a confirmatory (in the sense of significance 
testing) data analysis technique have been quite thoroughly covered in 
various books and technical papers and will not be discussed here. 
. The formal equivalence, mathematically speaking, of MANOVA and some 
aspects of discriminant analysis was alluded to in the last section. 



\ 

When the purpose of "reisearch" Is that ot drawing conclusions and Inves- 
tigating scientific probleiQS of group comparisons It has been suggested 
that ^'discriminant analysis" not be Identified as a tool of educational 
research. Rather, It has been claimed that discriminant analysis applied 
to practical problems of optimal classification of Individuals Into groups. 
[See Bock (1966, p. 822)*] In/some Investigatory situations, nevertheless. 
It may seem reasonable to use one-way MANOVA as a preliminary step to, 
or a first phase of, a discriminant analysis. The classical argument 
is that unless the investigator is assured of group differences to begin 
with, it is senseless to seek the linear composite to be u6ed for (discrim- 
ination or) classification purposes. However, even though a value of any 
one of many possible MANOVA statistics might tend to support the null 
hypothesis of mean homogeneity, it is possible that for one reason or 
another the data support the alternative hypothesis. 

If the mean differences among the criterion groups are all zer^, no 
differentiation is of course possible in the normal case with equal dls- 
persons, but it might be worth examining this special situation in the 
case of unequal dispersions. Bartlett and Please (1963) and Desu and 
Gelsser (1973) cover ways of looking at this problem when there are only 
two groups. 

In consldeiing the role of IIANOVA in a "discriminant analysis" the ^ 
mpst Important factor is the purpose of the analysis being performed and 
the questions one has of the data. The design of the study, including 
sampling, data collection, and questions ("contrasts" if you, like) , 
specify the data analysis technlque(s) • If the investigation entails 
some type of sampling of individuals with the notion of drawing conclusions 
in an inferential sense, about levels cf performance or about locations of 
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distributions, then MANOVA may be quite appropriate, this analysis may 
be followed up by what we have called discrimination and estimation 

methods. In this context, the variables whose means are being compared are 

/ . - • 

the dependent variables while the independent variable(s) IsCare) the group-- 

ing variable(s). -On the other hand, the study may be one of prediction (of 

group membership) , where the predictors are the independent variables and 

thfi dependent variable is a grouping variable* In this latter situation, 

there is no manipulation of the grouping variable, with the groups being 

■ ■ . " \ 

formed a priori . Here, MANOVA may npt be called for; the investigator 

proceeds directly to obtaining classification statistics. \ 

Discrimination \^ 
A great deal of research in-the behavioral sciences deals with the \ 
.comparisons of different groups of individuals in terms of one or more 
measures. What characterizes a "group" depends^ at least in part, on \ 
whether or not the grouping variable is manipulable as is experimental 
versus ex post facto studies. The MANOVA technique is often used in both 
of these situations when the data collection design is assumed to be 
appropriate. The omnibus null hypothesis tested in a one*-way MANOVA 
design is that of the equality of the population centroids. When the pop- 
ulations are significantly separated, subsequent and more detailed study of 
the group differences would definitely be called for. One follow-up technique 
is that which we label as "discrimination"; others, such as multivariate 
multiple comparisons, are discussed elsewhere (Stevens, 1973; Tatsuoka, 
1973a). Stevens (1972b) reviews four methods of analyzing between-group 
variation, one of which is based on linear discriminant functions. 
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Linear Discriminant Functions 
The procedures used In discrimination center around linear discriminant 

functions (LDFs). The mathematics behind LDFs Is presented In various books 

i 

and papers (see, e.g., Tatsuoka, 1;971; Porebskl, 1966a). One resulting 
formulation may be briefly described as follows. A linear composite of 
measures on p random variables for Individuals In K criterion groups, 

[11 ^1 "^ll'^l + ^12^2 ••• ^If^ " ^ • 

is determined so that MSH^/MSEy is maximized/ or , equlvalently, SSH^/SSEy is 
maximized. Here MSH^ and HSEy denote the hypothesis and error mean squares 



with respect to Y-scores, respec 
MANOVA design refers to the betwi 



Ively. (The "hypothesis" in a one-way 
en-group source of variation,) To obtain 
the v-values in [1], the largest! non-zero characteristic root (or eigen- 
value), X^f of E"^H Is computed; i.e., the largest value of X is obtained 
from the determlnantal equation, 

[1] |e-^h ~xi| - 0. 

The (pxp) matrices, E and H, are the error or wl thin-groups and hypothesis 
or between-groups sums of squares and cross-products (SSGP) matrices, 
respectively. Then the (pxl) eigenvector, v^, associated with X^^ is found 
by solving the set of p equations, 



[31 (E-^H -X^ I) v^ - 0. 



The elements of are (within a constant of proportionality) the coefficients 
of the linear composite in [1]. As is well known, there may be more than 
one LDF. The succeeding roots, ^2 ^ ^3 ^ ^ V ® (K-l,p)], 
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yield discriminant functions that are mutually uncorrelated (In the total 
sample). The successive functions are determined so as to maximize relative 
separation after prece^dlng functions are "partlalled out«" 

In Van de Geer's (1971) Integration of various multivariate techniques, 
the term "canonical discriminant factor analysis" Is used to describe the 
process of extracting the LDFs. Harris (1954) and Pruzek (1971) use the 
term "dispersion analysis." 

As we will see later, a means of Interpreting LDFs Is based on t\\e 

number of functions to be considered. Here, as In Interpretation of results 

• i . 

In other domains of multivariate data analysis, parsimony Is an objective. 
Data represented In a geometry of two-space, say, are more manageable and 
easier to Interpret than If represented In spaces of higher dimensions.* 
Thus, It behooves the researc^er to discard discriminant functions which 
are judged not to contribute to group separation. This Judgment can be 
subjective, in terms of the proportion of the total discriminatory power 
of p measures contained in a set of functions, or it can be based on 
statistical significance tests The former Judgment is based on ratios 
of individual eigenvalues to the sum of the eigenvalues. In the 
literature the process of testing the significance of a function has been 
lacking in clarity. First of all, Kendall (1968) has pointed out that 
such tests are "•••not so much tests of the functions as tests of hpmogen* 

eity (of population centrolds) by the use of the functions. If heterogeneity Is 
found, the function, ipso facto ^ is significant la the sense that it 
discriminates between real differences in an optimal way (except that 
we use estimators of dispersions and means instead of the unknown ' parent 
values) [p. 159]." Secondly, what hypothesis is of interest has not 



been clearly stated In some wrltlr^gs. The issue on what hypothesis is 

being tested pertains' to testing the significance of Individual functions 

(or eigenvalues), or testing the slgnlflcanee-^f. a set of functions after 

partlalllng out the complimentary set of functlonci that has earlier been 

judged to be significant. [The mechanics of both tents are given by 

Tatsuoka (1971, pp. 164-165) • Two sources which leave the readier wonder- 

Ing which ^hypothesis Is being tested are Elsenbels and Avery (1972 pp. 63, 

92-93) land Rulon, et al. (1967, p. 308). The test statistics reported in 

these latter two references are slightly in error — N rather than N-1 is 

^ ■ ■- ^ ■ 

used in ^the test statistics, for one error.] The chl-square statistics 

• \ ^ ' ' ■ 

used fot these two hypotheses are different, but in a .practical sense the -..\ 

■ ■ ■■ / ) 

conclusions are usually the same. That is, if it is concluded that the | 

mth eigenvalue (1 < m < s) is the smallest one which is significant, then 

we usually will conclude that the last s-m eigenvalues (or functions) 

as a set with the first m removed do not yield significance (See Harris, 1974 

Requisite Data Conditions 
The validity of the generally used MANOVA tests of equal population 
mean vectors depends upon the conditions of multivariate normality and 
equal covariance structure being met (Bock and Haggard, 1968, pp. 110- 

113) . The ref errent distributions used for theXvarious test statistics 

■ \ 

yield probability statements which may be somewhat distorted when 
either or both of the two conditions are not met. ^e degree and direc- 
tion of distortion are not known. The multivariate analogue of the 
Beh^ens-Flsher problem (normality with unequal dispersions) is discussed 
by Anderson (1958, pp. 118-122). Ito (1969) has proposedj alternative 
MANOVA tests to be used when either or both of the mentioned conditions 



10 

are violated; these tests show considerable promise for large samples. 
See also, James (1954). 

Tests for assessing the fit of data to both multivariate normality 
and equal covar lance are available. The tenability of the normality 
conkitlon is typically assessed via goodness-of-f it tests, which call 
for large samples* Lockhart (1967) has proposed a partial testing pro^ 
ce^ure for the small sample case; more recent empirical studies have 
been made by Aitkin (1972) and Malkovich and Afifi (1973). A test of 
the equality of the group covariance matrices proposed by 6.E.P. Box is 
presented by Cooley and Lohnes (1971, p. 229). The practical application 
of this latter test, and related concerns, are discussed by Porebski 
(1966b). 

Interpretation of LDFs 
We proceed, then, with our discussion of discrimination under the 
assumption that the two requisite conditions are^ at least, tenable. 
Having established the dimensionality of the reduced space, it is of interest 
to give some interpretation of the, say, r "significant" LDFs. One very 
us.'^ful means of interpretation is provided by graphic methods. Even though 
the LDFs are mutually uncorrelated, they are not geometrically mutually 
orthogonal in the spaces of the predictor variables (Tatsuoka, 1971, pp. 
163, 169). [In fact, the angle of separation between vectors representing 
two LDFs is an angle whose cosine is the Inner product of the two corres- 
ponding (normalized) eigenvec tor s^^ .However , It is customary and convenient 
to graphically represent the K group centroids jon the r LDFs by means of 
a rectangular coordinate system. The experience of this writer has shown 
that r is very seldom greater than two. That lis, two LDFs generally. 
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account for a great portion of the discriminatory power of the dlscrimln- 
atots and, hence, a two-dimensional representation gives a fairly accurate 
picture of the conflguatlon of the groups in the p-dimensional spaces. Of 
course, if r » 1, one can merely examine the numerical values of the K 
(pxl) mean vectors, Y, to determine which groups or clusters of groups are 
separated from which other groups or clusters. If r » 2, a two-dimensional 
plot is helpful in interpreting the dimensions along which the K groups 
were found to '^differ. For example, consider the plot in figure 1. 



— Insert Figure 1 About Here — 

From the graph it is clear that the first LDF^ discriminates Groups 2 and 
4 from Groups 1, 3, and 5; whereas the second LDF discriminates Groups 
1, 2, and 5 from Groups 3 and 4. If r > 2, pairwise two-dimensional plots 
rry be used. i , 

In making an interpretation of the resulting rLDFs, a substantive 
meaning of eacH function (or "canonical factor" or "canonical variate*') 
is sometimes attempted. Two approaches have been employed. The first, ' 
in the sense of tradition, is based on magnitudes of function- coefficients 
that are applicable to standardized^ scores. These "standardized weights" 
are found by multiplying each raw score coefficient liy the within-groups 
standard deviation of the corresponding variable: 

■ " ' : / . " ' 

where e^^ is the Jth diagonal element of E, m'^l,. ..,r, and J - l,...^p. 
These weights have been considered by some writers (Tatsuoka, 1971, p. 170; 
McQuarrie and Groteluescl^en, 1971) as if .they were factor loadings. 
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Such a use of standardized weights as a means of Interpreting LDFs has 
been criticized by Mulalk (1972, pp. 403, 422, 427) and Tatusoka (1973a, 
p. 280). This, use of standardized weights may be questioned on theoretical 
grounds: these weights are actually partial coefficients and, hence, do 
not pertain to the common parts among the discriminators; two discriminators 
having large positive weights would not necessarily have anything In common 
which contributed to group separation. 

The second approach that has been proposed for making a substantive 
or psychological Interpretation of the LDFs Is to use the correlations of 
the dlscrlmlnato^ with the functions. The values of these correlations 
depend upon the data matrices used. The LDF coefficients may be obtained 
by using a "wlthln-groups" formulation as reflected In equation [3], or 
via a "total-group" formulation which. In essence. Is a canonical correla- 
tion attack on the problem (Tatsuoka, 1971, p. 177). The matrix product 
used to get the (pxs) total-group (canonical) structure matrix is simply 

[5] - RV , 

where R Is the (pxp) correlation matrix based on T (« E + H), and V 
Is the (pxs) matrix consisting of the s LDF coefficient vectors. The 
structure matrix containing the wlthln-groups correlations is given by 

[6] - d"^ C V* , 

2 

where D - [dlag C], C Is the (pxp) wlthln-groups covarlance matr^ 
t « E/(N-K)], and V* is the (pxs) matrix of s LDF standardized weight 
vectors^ As might be expected, the correlations determined by [5] will 
be larger than corresponding correlations in [6]. In terms of labeling 
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the functions, the resulting Interpretations based on [5] or [6] will be 
the same. Such Interpretations are» at best, a very crude approximation 
to any Identifiable psychological dimensions. 

Darlington, Weinberg and Walberg (1973) contend that the choice 
between standardized weights and correlations for Interpreting LDFs 
ought to be based on the practical consideration of sampling error; they 
argue that because of greater stability, correlations ought to be empha- 
sized, at least In some cases, In a Monte Carlo study, Huberty and Blommers 
(1972) concluded that neither statistic, when based on the leading LI>F, was 
very stable In a cross-validation sense; this conclusion was not fully 
supported by Thorndlke and Weiss (1973). More will be said on this In the 
section, "Generallzablllty." 

There has been some attempt to achieve greater Interpretablllty by 
rotating LDFs. Tatsuoka (1973a, pp. 301-302) briefly reviews two studies 
In which rotation was used; the matrix to be rotated In one study (Anderson, 
Walberg, and Welch, 1969) consisted of the (total-group) variable - LDF 
correlations, and In a second (McQuarrle and Grotelueschen, 1971) consisted 
of standardized weights. It Is questioned by the present writer whether 
or not rotation of such canonical factors will. In most situations, be 
of great help In Interpretation.^ Tatsuoka (1971) states that rotaltlon 
(of a structure matrix, at least) "...requires further scrutiny atld theoret- 
leal Justlf Icatlon. . . tp. 301]." The Issue of oblique versus or^ogonal 

. • ■ . . / 

rotation of LDFs Is a theoretical one yet to be resolved. Two general 
methods of rotation are discussed by Hall (1969); one method wl/lch attempts 
to arrive at an Interpretable taxonomy of variables Involves an orthonormal 
rotation of a structure matrix. - 
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This discussion of LDF Interpretation may be closed with a caveat: 
unless an Investigation deals with variables having some common psychological 
grounds, attempting to substantively Interpret the LDFr(s) may be wasted effort 
Thus If substantive Interpretation of the functlon(s) Is Important to an 
Investigator 9 his Initial choice of variables ought to be made carefully. 

The problem of variable contribution to group separation In discriminant 
analysis Is a sticky one, as It Is In multiple regression analysis (Darlington 
1968). It may be argued that the variables act In concert and cannot 
logically be separated. As far as an Index to measure the "Importance" or 
the size of the "effect" of a variable ^Is concerned, no completely satis- 
factory proposal has been made. Traditionally the statistic used to assess 
the contribution of each variable (In the company of all others) has been 
Its standardized weight. The variable - LDF correlations discussed previously 
In terms of substantive Interpretation have also been suggested to order 
variables In terms of their contributions to separation. In a discussion 
which only Involved the leading LDF, Bargmann (1970) argues that if 
correlations for only some of the variables are large (in absolute value), 
and small for others, then the former variables contribute essentially to 
group separation. If the order in which variables are entered into the 
analysis can be determined a priori , then the step-down procedure of 
Roy and Bargmann (1958), discussed by Bock and Haggard (1968) and Stevens 
(1973)^ can be used for testing the significance r/ the contribution of each 
newly jBt^red variable. 

This section is closed with a proposal for ^ procedure of analyzing 

data for the purpose of discrimination involving K (>2) groups. This 

■ * 

"hierarchical analysis," which is only appropriate when the grouping variable 
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Is clearly categorical as opposed to being ordinal, may be described as 
follows. First of all, the variables are screened (as discussed la the 
next section) ; suppose that after this screening there are p variables 
remaining. As In equation [3], the eigenvector, v ^ , associated with the 
largest eigenvalue of E~^H Is determined. The correlations between each 
discriminator and the linear composite of all p variables, v^ X , are 
then found — the first column of SL In [6]. An argument for using only 

\ 

the first LDF Is presented by Bargmann (1969, p. 573). The variables are 
ordered on the basis of the absolute values of these correlations. The 
ordered array Is then examined for a "breaking point" (or possibly several. 
If p IsXlarge) between large and small absolute values. If a disjunction 
occurs, then the variables fall Into two classes with respect to dlscrlin- / 
Inatlon. New (leading) LDFs for the two subsets of variables are calculated, 
and the process Is repeated uritll no new subsets can be generated. (At c^ch 
step variable - LDF correlations are determined and a substantive Interpre- 
tation may be attempted.) A hierarchy of sets of variables, based on 
directly observable and, hence, Interpretable measurements can be thus 
established. This, It may be argued. Is preferable to an Interpretation 

of residual discrimination — that associated with v , say, after the elimi- 

^ \ 2 - 

nation of an artificial variable. 



Variable Selection 



The^process of selecting variables in discriminant analysis, as in 
any multivariate analysis, can be considered before or after the main 
analysis. If Cochran's (196A) conclusions can be extended from the two- 
group to the K-group case, the operation of discarding noncontributing 
discriminators at the outset may be hazardous. However, many statisticians 
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suggest that unless a variable Is *'slgnlf leant" In a univariate sense, it 
Is probably wasteful to Include it In a multivariate analysis, even if it 
correlates appreciably with good discriminators* Grizzle (1970) recommends 
that variables which do not have a reasonable expectation of containing 
information about group differences should not be included in the analysis; 
this would prevent a loss of power. The argument presented is based on the 
idea that the deletion of a non-significant variable does not change the 
largest characteristic root (X from equation [2]) very much. To conclude, 
preliminary to data collection variables ought to be chosen Judiciously;^^ 
on the basis of theory and prior research (Tatsuoka, 1969, p. 743). Then^ 
following collection of data on the p chosen variables, p univariate analyses 
are performed; those variables not yielding significance at a low probabil- 
ity level are deleted prior to the multivariate analysis. A possibly 
extreme situation is as follows. Assuming univariate ANOVAs .are appro- 
priate, clearly if the "signal-to-noise" ratio (F-value) for a variable 
is less than unity, eliminating the discriminator from fui^ther consideration 
^ is the sensible thing to do. 

The problem of variable selection or deletion may also be of interest 
after the initial multivariate analysis has been carried out. In many 
situations involving discrimination the investigator is presented with more 
discriminator variables than he would like and\ there arises the question 
of whether they are all necessary and, if not, which of them can be dis- 
carded. That is, having obtained the linear composite, the investigator 
may ask if the data might not have been adequately explained by using a 
subset of the original p discriminators. The objective ts to Include as 
many variables as possible so that reliable results may be obtained, and 
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yet as few as possible so as to keep the costs of acquiring data at a 
minimum. Reasons for reducing the number of discriminators may be summar- 
ized as follows [see Horst (19A1) for elaboration] : (1) to 
obtain fundamental and generally applicable variables, (2) to avoid 
prohibitive labor, and (3) to Increase the sampling stability of the 
LDF(s)* On the last reason Horse mentions that as the ratio of the number 
of discriminators to the number of individuals increases, there is a 
tendency for the accuracy of (discrimination) to decrease if the weights 
determined on the first sample are applied to a second group [p« 102]/* 

There is a dearth of literature covering the problem of variable 
selection or reduction in multiple-group discriminant analysis. No reason- 
ably optimal procedure has yet been developed for discarding variables; 
reasonable in the sense of amount of calculation, and optimum in the sense 
that the selected variables would yield the maximum amouht of separation 
among the groups for that number of variables* Of course, one could con- 
sider all possible subsets of the original p variables, but, just as in 
mulMple regression analysis, this is very expensive. Six "selection" 
procedures were teviewed by Huberty (1971a). The objective of one procedure 
\ ±s to. obtain a subset of variables that may be considered representative 
of the complete set (Bargmann, 1962a)* Representativeness is based on 
^ (maximum likelihood) factor analysis of the discriminator wlthln-groups 
intercorrelatlon matrix, followed by an oblique rotation of the resulting 



variables and the first (or leading) LDF are determined; l*e., the first 
column of S., in equation 16]. Variables are then selected that load on 




fad^tors* The usual elgenanalysls (see equation [3]) is then ^/performed on 
the Warlables that define each factor. Correlations between each of these 
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each factor and correlate highly with the respective leading LDF. Another 
procedure, suggested by Horst (1965, p. 555), Involves a principal component 
analysis of the discriminator withln-groups Intercorrelatloh matrlk, f olloiied 
by an orthogonal rotation of the resulting components. A subset of variables 
Is selected such that each component will be adequately represented In the 
subset. Variables are selected which have the highest loadings (In absolute 
value) on each of the components; persumably, no variables are selected which 
have high loadings on more than one component. The other four procedures 
discussed yield some type of ordering of the predictors; they are based 
on (1) standardized weights for the first LDF, (2) univariate F-ratlos, 
(3) discriminator versus (first) LDF correlations, and (4) an ordering 
provided by the BMD 7M stepwise program (Dixon, 1973). 

An empirical comparison of the six procedures using two sets of data — 
K « 3 and p » 13 for one set and K « 5, p « 17 for the other — was made using 
the criterion of the proportion of correct reclassifications of Individuals 
across all procedures for subsets of a given size. It was found that the 
stepwise procedure yielded the best subsets In ^the sense of most accurate 
classification. It should be pointed out that an objective of the procedures 
Involving a dimension (I.e., factor or component) analysis Is to select a 
subset that Is representative of the total set. Selecting a representative 
subset and one that will have nearly the same discriminatory power as the- 
original set will not necessarily characterize the subset selected 
simultaneously. 

As suggested in the preceding sectloh of the present paper, standard-- 
ized weights associated with a given LDF (first or otherwise) may be used 
to assess the contribution of each variable (in the company of all others) 
to the separation accomplished by that LDF. This method of assessment may 
O be extended to obtain a measure of the relative contribution of each 
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variable to the total set of LDFs. The set of weights are weighted by a 
function of the proportion of discriminatory power accounted for by each LDF. 
The measures to be used'lire given by the pxl vector 

17] a . V* X , 

where V* is the pxs matrix defined in [6], and \ is the sxl vector of eigen- 
values of E*^ H. Another proposal is made for ordering variables, with respect 
to group separation, on the basis of the variable versus LDF correlations. 
The measures to be considered are somewhat analogous to "communalitles" in 
factor analysis. These are given in the pxl vector, 

[81 b » [dlag SS'] , 

where S is the pxs structure matrix defined in [5] or [6]. The Jth element 
in b is the sum of squares of the "loadings" in the Jth row of S. 

Various selection schemes need to be researched further. The need exists 
ior empirical studies of other stepwise procedures; e.g., that proposed by 
Dempster (1963), which is a forward stepwise procedure with the variable order- 
ing determined by a principal component analysis. Hall (1967) has proposed 

a forward procedure involving multivariate analysis of covariahce (MANCOVA) 

- ■ ■ ■. ■ I 

which/ in essence is the same as that used in the BMD stepwise program; 

th^ variables already in the analysis are the variates while the remaining 

variables are the covariates. A variation of this use of MANCOVA to select 

the most effective discriminators is given in a study by Horton, Russell, 

and Moore (1968). See also, Smith, et al. (1972). Hotelllng's trace statistic 

was used as a criterion for selecting variables in a forward manner by 

Miller (1962). There is some argument for using a "backward" scheme, where 
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variables are deleted from, rather than added to, the analysis [see 
Mantel (1970)]. It would also be of Interest to study the appropriateness 
of the measures In [7] and/or [8] as Indicators of variable contribution. 

Elsenbels and Avery (1972) suggest a variable selection method which 
incorporates a number of techniques used jointly. Such a combination method 
may be described as follows. Determine variable orderlngs based on a number . 
of different techniques ~ e.g., stepwise, standardized weights, varlable- 
LDF correlations, backward elimination. To determine an upper bound on the 
number of variables to be retained, a minimum arbitrary acceptance level of 
the reduction In discriminatory power using a set of size q Instead of all 
p variables Is set. Elsenbels and Avery support the one percent significance 
level of a MANCOVA F-statlstlc as a criterion. Other criteria such as a 
significance level of Hotelling*s trace statistic, or a specified proportion 
of correct classifications yielded by the set of entered variables may also 
be used* This may give t (the number of techniques used) different subsets 
of size q. Two different approaches may now be taken to arrive at a single 
subset. One approach, mentioned by lElsenbels and Avery (1972, p. 82), Is to 
determine those of the q variables and of the p-q variables that are common 
across the different techniques. The latter variables, say m' in number, are 
discarded from further consideration, and the ^former, say n In number, are to 
be included Iri the final subset of size q. To obtain the final subset of size q, 
then, the best q-n are selected out of the p - (m-Ki) questionable variables 
by considering all posslble^subsets of size iq-n. A second approach, similar to 
that suggested by Draper and Smith (1966, p. 172) for use In multiple regression, 
itf^to consider all possible subsets of size q from tl>e p original variables. 

Finally, it is noted that after a subset of variables has been 
selected it is. desirable to reanalyze the data only on the selected 
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variables SO as to assess their relative contribution. This is particularly 

true when the assessment Is based on standardized weights; the rank-order 
of the selected variables as a set by themselves m$y be different from 
their rank-order when considered in the company of all the original discrim- 
inators, 

Generallzablllty 

Even though discrimination Involves basically exploratory techniques, 
very often its users attempt tc generalize results to other sets of subjects, 
other variables, or other situations. Generallzablllty may be thought of 
in terms of statements of Inferences from sample results to some population, 
and in terms of stability of the obtained results over repeat^ sampling. 
Mulalk (1972) emphasizes the caution with which one proceeds in making 
Inferences when treating LDFs as factors. One warning is that with the 
formulation of [3], the LDFs obtained pertain to the variables after 
variance in them due to group differences has been removed from them. 
Thus, such dimensions do not reflect variance which existsf in the variables 
on which the groups differ and . .may in some contexts give misleading 
characterization of the nature of the discriminant functions Ip. 428]." 

Not much conclusive evidence has h^en found regarding the stability 
of results in discrimination studies. /In a Monte ^arlo investigation 
designed to study the comparative stability of standardized weights and 
varlable-LDF corr^elatlons, Huberty Blonmers (1972) found that neither 

index held up to any great extent under repeated sampling. Cit should be noted 

■I * 
that only the leading LDF was considered in that study.) Two sets of 

live data weriB used in a study by Thorndike and Weiss (1973) who 

concluded that if an investigator "uses a single sample and'attempts 



to Interpret the canonical components (LDFs) he may be Interpreting 

nothing more than sample-specific covariation [p. 131]." They did conclude 
however, that component loadings (varlable-LDF correlations) are^ consistent 
In cross-validation and In this sense are more stable and more "useful 
than standardized weights. Stevens and £arclkowskl (1974) concluded from 
a Monte Carlo study that In some situations (depending upon variable 
Intercorrelatlons) standardized weights are more stable than varlable- 
canonical varlate correlations, and In other situations the reverse holds* 

Problems of generallzablllty due to Instability of some results 
appear to point to the need for replication of studies and cross-validation 
of findings. Of course, the use of simple (or double) cross'^valldatlon 
techniques call for relatively large samples. [Tatsuoka (1970, p. 38) 
suggests that In a usual discriminant analysis the size of the smallest 
group be no less than the number of variables used, p. This may be a bit 
conservative*] To u^e cross-validation techniques It Is recommended that 
the smallest n-value be at least as large as 3p. Then In the cross- 
validation process, a random one-third of the total number of observations 
may be withheld from each group to serve as a "holdout sample*'' Horst 
(1966) points out the dilemma Int^ which one Is placed when using cross- 
validation techniques: "If we develop a procedure and then cross-validate 
It, we have Ipso facto not developed the best procedure possible from the 
available data [p. 140] ." 

^ Specific Uses of LDFs 
The use of LDFs as an aid In the Interpretation of MANOVA results 
was mentioned In the "Introduction" of this paper as a major breakthrough 
In multivariate analysis. Uses of LDFs In factorial MANOVA are Illustrated 



b|y Jones (1966) and Saupe (1965) who point out that multiple LDFs are use- 
ful in interpreting the source of significant interaction effects as well 
as the source of significant main-effect differences. Both of. these writers 
base their interpretations of LDFs on standardized weights in preference 
to variable-LDF correlations. Some writers (e.g. , Timm, 1974) prefer the 
use of simultaneous test procedures (Gabriel, 1968) for studying significant 
differences, while others (e.g., Tatsuoka, 1973a) prefer the LDF approach. 

Tatsuoka (1973a, p. 284) also suggests that LDFs may be helpful in 
deciding when to terminate a clustering procedur'^ such as that of Ward 
(1963). At each stage of the analysis the LDFs based on the clusters (of 
individuals) determined to that point can be examined for Interpretability. 
Discrimination procedures were used by Rock, Baird, and Linn (1972) as 
a follow-up to a cluster analysis involving areas of study of college 
students. In addition to univariate F-values, discrlminator-LDF correla- 
tions were considered In assessing relative contribution of the variables 
to the obtained first LDF. 

Techniques of discrimination have been sho\m to be of help in the 
study of pattern recognition. The research of Kundert (1972) illustrates 
the use of an LDF in assigning scale values to c^ategories of a response 
variable, irrespective of the manner in which /the categories may be ordered; 

Discrlminatioft in Two-Group Case 
The relationship between multiple-group discriminant analysis and 
canonical correlation was pointed out previously. The lower level relation- 
ship between two-group discriminant analysis and multiple correlation has 
been the subject of many writings. The proportionality of the raw score 
coefficients for the two analyses was shown by Michael and Perry (1956). 
More recently this proof has been vastly simplified via the use of matrix 



notation (Healy, 1965; Porebski, 1966a; Cramer, 1967; Tatsuoka, 1971, 
pp, 171-173). .Because of this relationship, many of the methods used In 
Interpreting a regression analysis are applicable in two-group discriminant 
analysis. Collier (1963) showed that tests used in deleting variables In 
regression analysis and in discriminant analysis are equivalent, while 
Huberty (1972b) showed that predictor variables may be equivalently ordered 

(with respect to contribution to separation) by univariate F-ratioB and 
by within-groups variable versus LDF correlations. Cochran (1964), 
Weiner and Dunn (1966) , and Urbakh (1971) have also studied the problem 
of eliminating variables in the two-group case. . 

In studying discrimination between two groups of foreign graduate 

I 

students in business administration, Grimsley and Summers (1965) applied 
tests of significance of the LDF coefficients — see Kendall (1968, p. 163)- 
in determining the most effective combination of discriminators to differ- 
entiate between success and failure groups. Recently, Eisenbeis, Gilbert, 
and Avery (1973) studied the problem of variable assessment and selection 
in the context of a specific empirical problem. They concluded that the 
various selection methods studied "•..could yield radically different infer- 
ences about the relative power of individual variables [p. 218]." It was 
also concluded that "... the assessment of the relative performance of 
the different subsets also varies depending upon whether the goal is to 
select the subset that maximizes differences between group means or to 
choose the combination of variables that yields the best classification 
results [p. 218]." Huberty (1974b) has. shown the formal equivalence 
between a test for deleting variables which is based on distance and 
a test based on MANCOVA for the two -group case. Implications of this 
equivalence for Interpretation of results of multi-group analyses were 
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were discussed In the section, ''Variable Selection.^' 

Discrimination Research Applications - 
No attempt will be made to review all studies in behavior research 
that incorporate discrimination procedures. Rather , selected journal articles 
will be cited so as to acquaint the reader with Cl) some research situations — 
i^e., types of subjects, criterion groups, and discriminators — for which 
discrimination may be helpful, and (2) the discrimination techniques being used. 
No critique of the substantive discussions and conclusions presented in the 
articles will be attempted. The articles reviewed are in addition to those 
discussed earlier in this paper and all appeared in 1968 or later. Huberty 
(1969) cites 30 studies reported from 1963 to 1968 in which discriminant 
analysis techniques were used. _ 

The entire focus of one study was a two-group simple MANOVA, although 
the analysis technique was described as a "multiple discriminant analysis." 
Williams (1972) used six factor scores on a semantic differential for differ-- 
ing socioeconomic status groups — low versus raiddie — of 181 fifth grade 
inner city school children • Another study (Maw and.Ma§oon, 1971) involved 
a 2x2 MANOVA design with sex and curiosity as the grouping variables. Twenty-^ 
six measures on affective, cognitive, personality, and social trait variables 
were obtained on the four groups of middle class white fifth-graders. Since 
sex-by-cur ioalty Interaction was not significant, the~^ associated LDF was 
not considered for Interpretation. The var lab le-LDFc6rrelat ions (the 
type was not specified explicitly) were used in interpreting the sex and 
the curiosity LDFs, Canonical correlations for sex and curiosity, each 
versus the 26 variable composite, were also examined. 

Two studies were found in which the LDF interpre.tation was based on 
standardized weights. Project TALENT data were used by Schoenfeldt (1968), 
in a study involving a random selection of about 300 students in each of 




K post high school education groups. Measures on 79, variables were 



available; after preliminary screening, 26 were selected for use. A 
64-item study strategy questionnaire was used by Goldman and Warren (1973) 
to obtain data on 538 university students who were In four different under- 
graduate major areas* In the study, the weights were used both to assess 
relative contribution to separation and to give meaningful Interpretation 
to the resulting LDFs. Two-dimensional plots of group centroids were used 
in both of these studies as an aid to interpretation. A third study which 
used discriminant "weights" was reported by McNeil (1968). What weights 
these were was not made explicit. Here, 521 sixth grade children in four 
subcultural groups were considered for separation by six factors which 
resulted from a "factor analysis" of 20 semantic differential scales. 

Dlscrlmlnator-LDF correlations were utilized in two studies for purposes 
of LDF interpretations. Field, et al. (1971) obtained measures on 57 j 
undergraduate Australian students using an 18-ltem questionnaire assessing 
teaching behavior. These data were examined to evaluate discrimination 
among six teachers, including one "ideal" teacher. Substantive interpreta- 
tions of the LDFs were made. Bausell and Ilagoon (1972) used data on 29 
items of the Purdue Rating Scale for Instruction for approximately 2000 
sophomore to senior university students for purposes of differentiating 
four criterion groups defined by student expected grades. Total -group 
variable-LDF correlations were used to substantively Interpret the LDFs, ' 
as well as to assess the "importance" of the discriminators. They also 
Incorporated the statistic, 1 - A, for interpretive purposes. See also, 
Whellams (1973). 

The discriminant analysis techniques used by Chapln (1970) were not 
clear. Tn his study of four groups of mathematics teachers (determined by 
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principal ratings) on whom personal and academic measures were obtained , he 
states, "After extensive sorting through 40 variables. It was found that only 
a fev variables contributed significantly to the discriminant analysis [p. 161] 

The two remaining studle3 to be briefly reviewed use a large number of 
criterion groups. Baggaley, Isard, and Sherwood (1970) used 17 groups 
(14 academic, three "miscellaneous**) of university juniors; ten personality / 
measures were obtained on each of 628 students. These investigators 
examined ''normal Ized*' vector coefficients for relative variable contrlbu*- 
tlon, and to meaningfully interpret the LDFs. A two-dimensional (why 
two?) plot was given. A set of 26 personal and academic measures for 
college undergraduates was used by Burnham and Hewitt (1972) to differentiate 
among 16 occupational groups in a follow-up study* Univariate Student 
t-values were considered to assess relative contribution of the dlscrlm- 
Inators. 

From a statistical point of view, criticisms of the methcidology 
used or of the reporting in some of these studies are possible. It 
is recognized, however, that writers and/or editors may have reasons for 
not including all the details of the techniques us^^. 
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Estlraatlon 

To restate, estimation Is that aspect of discriminant analysis that 
pertains to characterizing Inter-group distance and strength of relationship. 

Measures of Distance 
About 1920 K. Pearson proposed his coefficient of .racial likeness 
(CRL) as a measure of distance which was subsequently used mostly in cranl- 
ology. In the middle to late 1920' s G. M. Morant suggested a corrective 
factor to be applied to the CRL to offset effects due to varying sample 
sizes; at about the same time, P. C. Mahalanobls proposed a Euclidean dis- 
tance measure* [See Hodges (1950, pp. 5-25) for a more complete development 
of the history of distance measures.] The distance between two population 
- tientroids may be expressed as 

^1-2 12 

where y^^ is the centroid of population k, and I is the covariance matrix 
common to the two populations. The quantity A has become known as 
Mahalanobls' generalized distance, and the square of the sample distance, 

[9] » (X - X )' S-^ (X - X ) 

-1 "2 ""1 

is often referred to as Mahalanobls' statistic. The (pxp) matrix S in 

[9] is defined by (N + N -2) S » E; X. is the centroid of group k. * 

12 

Although S is an unbiased estimator of Z , it should be noted that 
is not an unbiased estimator of A^ (Rao, 1949). Since an unbiased esti- 
mator of A^ often results in negative estimates of the square, its use 
is discouraged. In presenting a logical derivation of D as a distance 
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measure, Rao (1952) points out that D "...Is applicable only to groups 
in which the measurements are normally distributed Ip, 355]." 

If significant group separation is found » it is possible to gain some 
insight regarding group differences by simply calculating the Euclidean dis- 
tance (as in [9]) between all pairs of centroids. If, for example, distances 
between all pairs of K-1 of the groups are small, yet at the same time, 
the kth group is distinctly separated from the other K'-l groups, it is clear 
that the only separation taking place occurs between the kth group and its 
complement, i.e., the other K-1 groups. These pairvise group distances are 
often given in the output of computer programs — e.g.> the BHD 7M program 
(Dixon, 1973). 

It may be noted in passing that a transformation of each statistic 
may be used as a test statistic in the two-group case. This transformation — 
see Rulon and Brooks (1968, p. 69) — may be considered as an alternative to 
Hotelling's or Wilks' A statistics. 

As will be apparent later, the distance function D^, or a variation 
thereof, appears in most multivariate classification schemes. That is, a 
measure of distance between an individual's data point, X, and the kth 
group centroid is of interest. This measure, assuming a common population 
covariance matrix, E, is given by 

and the sample distance is given by 

/ - , 

[101 \ - f(x - Ij,)' s-Mx - 4)]^ . 
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Rao (1952, p. 257) gives a generalization of Mahalobis' statistic 
(labeled "V" by Rao) : 

K _ _ _ _ 

where X is the (pxl) vector of predictor means across all K groups. It 

can be shown that W can be used as a chi-square statistic with p(K-l) degrees « 

of freedom to test the hypothesis of equality of the K population mean vectors. 

As Rao points out, the W statistic may be partloned Into Independent chl- 

square statistics so as to Judge the significance of Information lost when 

some variables are deleted. This criterion is equivalent to Hotelllngs* 

trace statistic, the use of which was proposed by Miller (1962) for 

variable selection — see Friedman and Rubin (1967, p. 1162} • The value 

of W is part of the output of the BMD 5M program (Dixon, 1973). It turns 

out that when K » 2, W « D^- • IT^N^/CN. + N^) . 

Measures of Discriminatory Power 
As most researchers who have toyed with univariate measures of associ- 
ation know, the numerical values of most proposed Indices for a given set 
of data are nearly the same. This has also been shown to be the case in 
the multivariate situation by Ste^^ens (1972a) and Huberty (1972a) . Mul- 
tivariate measures of strength of relationship, or of discriminatory power, 
that have been proposed are : (1) 1 - A [Wilks* statistic], (2) U/(l+U), 
where U » tr (E^^H), (3) U7(l+U'), where U' « (N-p-l)U/N, and (4) an 
extension of Hays' (1973) omega squared. Tatsuoka (1973b) has studied the 
properties of this last measure which he proposed earlier (Tatsuoka, 1970) 
and which may be expressed as 
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(N-K)A-^ +1 

The Intent of Tatsuoka's more recent study was to develop an unbiased 
estimate of 

UH . ;2 , «1^ . JlJ - . 

where £ Is the (pxp) covar lance matrix common to the K populations, and A 
Is a (pxk) matrix of effect parameters -r- the (j,k)th element of A Is the 
deviation of the kth population mean from the general mean for the jth 
variable. (Formula [11] expresses the proportion of generalized variance 
of the p variables attributable to differences among centrolds.) Since 
various attempts to develop such an estimate were qf no avail, an attempt 
was made to develop a formula for correcting the (positive) bias In ^^^^^^^ 
Empirical results led to a "rule-of«-thumb" correction to be used with small 
samples: 

[12] « - P^ -f (K-1)^ (1- ,J . 
^ ^ corr mult — — ^ — mult 

It was found that oi^ ,^ Itself can be used when <.30. and N/p > 100 

mult mult — 

or when >.50 and N/p > 50; however, neither situation Is tvplcal of 

mult " *^ 

that found In eiiucatlonal research. Formula [12] was deemed to be adequate, 
at least when p(K-l) ^49 and 75 <. N 2000* 

Another Index of discriminatory power which has received some attention 
is the proportion of correct classifications across all K groups (Cooley and 
Lohnes, 1971, p. 329). In a multiple discriminant situation this Indicator of 
strength of predictive validity may be- more appropriate than a correlative 
,^p.9^ea8ure. This statistic will be discussed in some detail later. 



\ Classification * 

As noted previously, the original intended purpose of "discriminant 
analysiV' and the linear discriminant function (LDF) was that of classifi-* 
cation. vGiven a sample of individuals (or objects) from each of two or 
more populations, Le want to construct a method of assigning t new lndivi«- 
dual to the\ correct population of origin on the basis of measures on p 

variables. Classification procedures are used to solve prediction problems; 

\^ 

given measures^ on the p predictors it is of interest to predict membership 
in one of the natural or preexisting groups. A more formal view is that 
classification is used to answer the question: Given an individual with 
certain measurements, from which population did he emanate? In this sense, 
the problem of classification may be considered a problem of "statistical 
decision functions** (Anderson, 1958, p. 126). The p predictors are the 
independent variables and the single criterion is the grouping variable, 
the latter being, of course, nonmetric. Illustrations of the use of 
multivariate classification in behavioral research are given later. 

Requisite Information 

Various methods of multivariate classification have been proposed. 
The use ot these methods presupposes that the user has knowledge of certain 
information. This information may in the form of: (1) the density 
function which best describes the data on hand, (2) restrictions on data 
conditions necessary to select the most appropriate method, (3) prior 
probabilities of group membership, and (4) misclassification costs. 

The early work in discriminant analys;ta, specifically that on LDFs and 
generalized distance, was based on multivariate normal probability dlstribu- 
*"\3"'^' '^^^ general theory of classification is not, however, dependent 



upon multivariate normality. General distribution-based rules of assign- 
ing Individuals to populations (so that probabilities of mlsclasslf Icatlon 
are minimized) are discussed by Anderson (1958, pp. 142-147) and Overall 
and Klett (1972, Ch* 12)* Parametric and nonparametrlc density estimators 
were used to estimate the non-error rate of an arbitrary Classification 
rule and to construct a rule which maximized estimated probability of 
correct classification by Click (1972). Insofar as could be determlned» 

little work has done with classification rules involving continuous 

I - 

predictors other than' those based on normality. 

Assuming multivariate normality, a linear classification rule may 

be used when it can be further assumed that the condijtion of equal covar- 

/ ■ - . 

lance structures across the K groups is met. As will be shown later » 
differences in covar lances, as well as differences jLn means » can be utilized 
in making predictions about group membership » and in estimating error rates. 

A helpful consideration to be taken when confronted with the problem 
of classification is that of prior probabilities o£ group membership. Such 
a probability is that of drawing at random an individual of each group 
from the total population of all K groups. Taking the approach of fre- 
quentlsts» these priors are relative frequencies of individuals of each 
of the K populations in the total population. From sample data» then» 
these probabilities are estimated by k - 1».^,»K« [Tatsuoka (1971, 

pp. 225-226) discusses problems in using such estimates.] Priors may 
also be estimated by using Markov models as suggested by Lohnes and 
Grlbbons (1970); alternatives to these suggestions were considered in' 
an empirical scudy by Lissitz and Henschke-Mason (1972). 

Typically t in educational research differential costs of mlsclass- 
lf ylng individuals into the K groups are Ignored — Ignored in the sense 



that equal costs are assumed. This Is not too surprising since quantlflca-* 

tlon of costs of mlsclasslf Icatlon In educational research may be difficult, 

I 

even though only relatlv^ costis are Important. ■ ^ 

Classification Rules 

Various parametric and nonparametrlc rules of classification have been 
proposed. In most of these rules some notion of "distance" comes Into play; 
that Is, an Individual Is assigned to that group whose centrold Is closest 
to the data-point representing him. "Closeness" Is measured by a probabil- 
istic notion of "distance," as opposed to the geometric Euclidean distance 
measure discussed In an earlier section. The use of the LDF for classifica- 
tion purposes In the two-group situation was Initially based on simple 
Euclidean distance -- assuming multivariate normality and equal covarlance / 
structure, an Individual was assigned to the group with the mean discriminant 
score nearer to his discriminant score, Flsher^s LDF as a classification 
statistic was not at first considered In reference to a probabilistic model. 
The relationship of posterior probability of group membership to the LDF 
was noted by Welch (1939) when he proved that the assignment procedure 
based on the LDF minimizes the probability of mlsclasslf Icatlon under 
certain restrictions. Von Mlses (1945) extended Welch's notions to the 
K-group case, and removed the restriction that probabilities of mlsclassl- 
f Icatlon per group be equal. 

The classification statistics discussed In this paper will be stated 
In terms of estimates of population parameters; in so stating, no claim is 
made that an optimum solution is obtained. Further, only the situation 
of equal costs of mlsclasslf icatlon will be considered. Assuming 
multivariate normality and identical population covarlance matrices, the 
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distance measure [10] has been used as a classification statistic — as 
well as a criterion in cluster analysis (Friedman and Rubin, 1967). An 
individual, with score vector is assigned to that group, k, for which 
the distance measure 

[13] I>ik- V' V^'' . 

is least. These measures may be transformed to '*centour scores" which 

are functions of probabilistic distances (Cooley and Lohnes, 19/1, p. 265). 

For a gtvf^n individual, the assignment is based on the largest centour. 

An inadequacy of [13] is that differential prior probabilities, Pj^, 
of group membership are ignored. Using the multivariate normal distribution 
function and retaining the equal covariance condition a modification of , , 
[13] becomes 

[14] L^^ - 'H ln(si T>1^ 1« Pk ' 

The more popular form of a "linear discriminant score" (Rao, 1965, p. 488), 
[14a] L^^ = ^ S-\ -h + In . 

is equivalent to [14], since the terms -^ln|s| and 2L[ S"^X^ are common 
to all k. Thus, individual i is asbigned to that population whose corres- 
ponding sample yields the largest value of the classification statistic [14]. 
Such a rule minimizes the number of misclassif ications, in a parameter 
sense, and is equivalent to a rule which assigns the individual with 
measures X. to that population for which the posterior probability of 
population membership is largest. Some writers [e.g., Eisehbeis and Avery 
(1972, p. 18)] prefer to express the classification statistic ^s a posterior 
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probability: 
[15] 



'ik 



K . 



[Statistics [14] and [15], which are equivalent In the sense of classifica- 
tion results » are^ In turn, equivalent to those reported In Rule IV by 
Coo ley and Lohnes (1971, p. 269) and In Rule 5.6 by Elsenbels and Avery 
(1972, p. 19).] Expressions [14] and [15] lead to what Is sometimes 
referred to as the "linear classification rule"; [14] Is linear In that 
L^j^ Is linear In X^. Equation [15] exemplles the Bayeslan conditional- 
probability model. 

Another linear rule based on posterior probabilities of group member- 
ship under the present conditions has been proposed. The formula used to 
compute the posterio"^* probabilities is based on "Case E. » I but 
unknown, jJj^ unknown," presented by Geisser (1966, p. 155). [See also Copley ic 
Lohnes (1971, p. 269).] Geisser 's work resulted lu uheclaaeification 
statistic, 

Pk • ^ik 



tl6] 



Hk 



K 

L 

k'-l 



^ PkV* ^ik' 



where h^j^ Is the "predictive density of a future observation (vector) given 
the available data." and is proportional to 



N, 



lP/2 



N, + 1 
k 



1 + 



\ ^ik 



1 -(N-K + l)/2 



(Nj^ + 1) (N - K) 
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It has been shown (Huberty, 1971b) that when the Nj^-values are Identical 
statistics [14] and [16] yield the same results. Of course, since for a 
given Individual the denominators of [15] and [16] are constant, only the 
maximum values of the numerators need be considered In making assignments. 
However, actually obtaining the probabilities may provide Information In 
addition to t^at \)f mere number of correct and Incorrect classifications. 
For example, a vector of P., - or Q., - values of (.80, .15, .05) versus a 
vector of (.48, .46, .06) would lead to the same decision, namely, assign 

to group 1. However, It may be Informative to examine such vectors to ! 

j 

determine those Individuals, and their characteristics (as reflected Iti X- 
vectors), who are mlsclasslfled. Also, by examining the probability vectors. 
It can be determined the group that an Individual Is most like (highest value) 
and the group he Is most unlike (lowest value) . 

Under the condition of unequal covarlance matrices, variations of the 
above three classification statistics are called for. If equal covarlance 
structure cannot be assumed, then S in [13] is replaced by the sample 
covarlance matrix for each group: 

[17] Kx, -^)' (x.-^)]*^ . 

Taking into account different group covarlance matrices* the counterpart 
of [14] may be expressed as a "quadratic discriminant score," 

[IS] = -^IniSj^l -hOy^^)^ + In pj^ . 

Again, this classification statistic may be transformed to a statistic 
that yields posterior probabilities of group membership: 
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[19] 



PI 



IsJ-^ exp [-Ji(D'j^)2] 



ik K 



I P^.IS^I-^ exp i-h (D^k-^^^ 



k'-l 



[Seie Tacsuoka (1971, p. 228) for a discussion of the transformation of [18] 
to a posterior probability.] Formulas [18] and [19] lead to what Is sometimes 
called the "quadratic classification rule," since [18] is quadratic in X^. 
A second quadratic rule — [18] and [19] yield identical results, as do 
[14] and [15] — has been proposed that is also built on posterior probabil- 
ities of group membership. The probaM^ities are based on a Bayesian density 
specified In Geisser's.(1966, p. 154) "Case C. unknown, ji^ unknown." 
(See also, Press, 1972, p. 375.) The posterior probabilities are given by 



[20] 



^ik 



ik 



K 

E 

k'-l 



Pk';°«ik' 



where Is a . density proportional to 



TP/2 



N, + 1^ 
k 



1 + 



N? - 1 



ri \ - Ml(Nj^-i)sj^ 



It can be shown that [20] and [19] (and hence [18]) yield Identical results 

when the N, -values are the same, 
k 

Horst (1956a) considered a formulation of the classification problem 
Involving separate regression equations contrasting each criterion group 
In turn with all others. In finding the regression equation corresponding 
to group k, the dlchotomous criterion variable assumes the value 1 for 
Individuals in group k and 0 otherwise. To estimate the coefficients used 
in Horst 's *'least squares*' multiple classification method, the total co- 
variance matrix of the predictors is Involved. The following classification 
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statistic results, when N^-values are Identical^.. 
121] Y^j^.b; (X^-X) ^\ . 

where is the (pxl) vector of sample coefficients for group k, 



with T » H+E and u^^ the vector of deviation score cross-products of the 
predictors and the (dichotomous) criterion, the deviations being taken from 
the grand means, and Y^^ » * statistic [21] teadff^ to a decision 

rule which assigns an individual to that population for which his correspond- 
ing composite score is nearest unity. A modification of [21] is required 

with different N, -values (Horst, 1956b). 
k - 

For ease of reference the seven statistics presented are given in 
Table 1. 

— Insert Table 1 About Here — 

With this apparent variety of classification statistics available, 
which does one use? Assuming the condition of multivariate normality is 
tenable, the choice seems to depend upon whether or not the added condition 
of equal covariance structure is also tenable, and whether or not differen- 
tial priors are to be involved. lAn added criterion of choice may be 
one's preference for use of statistics based on the classical approach or 
on the Bayesian solution of Geisser (1964, 1966) and Dunsmore (1966). The 
Bayesian solution is simpler to come by in that it is not based on any 
complicated distribution theory. ] In a Monte Carlo study where both 
noncross-validation and cross-validation results were reported Huberty 
and Bloramers (1974) concluded that the rule based on t21], or its modi- 
fication for unequal N, -values does not yield as great accuracy as that 



yielded by the other rules considered. Knutsen (1955) , however , concluded 
from a single sample that this rule was more accurate than the rule based 
on [14]. Huberty and Blommers (1974) also found that by incorporating 
prior probabilities into a rule, classification accuracy is enhanced — 
rules based on [13] or [17] vr'»-sus those based on [14] or [18]. They 
further concluded that rules based on [16] and [19] yielded nearly the 
same results; no comparison of [16] and [15] was made, but since sampling 
was made from populations with a common covariance matrix, it is conjectured 
that these two statistics would yield similar accuracy of classification. 
So, in the linear case — when covariance matrices are taken to be equal — 
either [14] (or [15]) or [16] may be used as classification statistics with 
expected results very similar. 

Insofar as could be determined no studies have been undertaken to 
compare the efficiency of [18] (or [19]) to that of f20]. Cooley and 
Lohnes (1971, pp. 270-272) report the results of some Monte Carlo classi- 
fication studies in which the efficiencies of [14], [16], and [19] are 
compared. (Their "Anderson method" is equivalent to that based on [14].) 
Their results reported do not suggest the superiority of any one of the 
three^ statistics; they do conjecture, however, that the rule based on 
[19j might suffer more from capitalization on chance differences In 
covaridlnces [p. 272]." It is noted in passing that the equivalence of 
[14] and [16] with equal Nj^-values and Pj^ - N^^/N - 1/K for all k (Huberty, 
1971b), was empirically verified by Cooley and Lohnes (1971, p. 272) when 
these statistics led to the same proportion of correct classifications. 

A description and application of a simulation program designed to 
obtain estimates of different types of misclassif icatlon probabilities 
and to compare linear and quadratic classification rules is given by 
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Mlchaells (1973). The different misclassif Icatlon probabilities are those 
discussed In a subsequent section, '^Estimating Error Rates." It was assumed 
that all prior probabilities and misclassif icatlon costs are equal. The two 
multivariate normal classification rules used were, in essence, those based 
on [14] (the linear rule) and on [18] (the quadratic rule), disregarding the 
In Pj^ terms. In the simulation^ process, ehe model parameters were chosen 
to be equal to parameters which had been estimated from real data. The 
basic model considered wa^' one where K « 5 and p « 8 with unequal population 
covariance matrices. Sample sizes of 30 and 100 per group were used. Both 
"internal'* classification," where the parameter estimates are based on the 
samples classified, and "external classification," where the parameter 
estimates are based on a sample other than that classified, were used. As 
might be expected, quadratic classification yielded the better results. The 
difference between the results of internal and external classification was 
found to be substantially larger for the quadratic than for the linear rule, 
especially for the smaller sample size. This is presumably due to the fact 
that the number of estimated parameters is much smaller in the linear rule. 
For all simulated larger samples (Nj^ « 100) the external quadratic classi- 
fication gave better results than th^ corresponding linear classification, 
although the estimation of the parameters was not yet very good, as could 
be seen from the differences between internal and external results — 
especially for quadratic classification. Even with the smaller sample 
sizes, where the differences between internal and external analysis are 
very large, in most samples external quadratic classification gave better 
results than the corresponding linear classification. In conclusion, 
Michaelis recommends both an internal and an external classification 
in each practical application. The differences between the two resulting 
proportions indicate an interval in which the "true error" can be expected 
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to lie. Further, if the two proportions differ greatly^ one could expect 
to achieve better classification of independent samples by increasing 
the sample size. Results of several other simulation experiments are 
graphically reported. This is an excellent reference; for anyone interested 
in Simulation experiments in multivariate classification. 

A few writers have advanced arguments in favor of using classification 
statistics based on LDFs rather than on the original predictors (Cooley 
and Lohnes, 1962, p. 139; Tatsuoka, 1971, p. 232; Eilsenbeis and Avery, 
1972, p. 56). Briefly, the arguments presented for using such "reduced 
space" procedures are: (1) the linear transformation (\i^en covariance 
matrices are equal) preserves the overall structure as well as distances 
in the reduced (or discriminant) space of dimension r » min(K-l,p), and 



computations are easier; (2) since most often, r £ 2 



, computations are further 



reduced, and interpretations are simpler; (3) the Central Limit Theorem 

\ 

implies that the distribution of the linear discriminant scores for each 

1 \ * 

group approaches normality;' at^d (4) classifications may be more consistent 
over repeated sampling because of relatively greater stability of statistics 
based on LDFs. 

In their empirical study. Ruber ty and Bloramers (1974) found that the 
decision rule based on [19] with discriminant scores as input did better 
over repeated sampling than with original predictor scores. From the results 
of another empirical study, where the conditions of . normality and equal 
covariance matrices were controlled, Lachenbruch (1973) concluded that a 
reduced space classification method works about as well as the method based on 
[14] if the population means are collinear or nearly so. Otherwise, [14] 
proved much better. In that study, the sample size and the Pj^-values 
were taken to be equal across the groups. Lohnes (19.61) uSed [17] in 
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classifying three sets of real data in both the original spaces of the predic- 
tors and in the discriminant space. The equal covariance structure condition 
was not met for at least two sets — results were not presented for the 
third data set. For all three sets it was concluded that the two methods 
produce comparable classification results. 

Four different classification rules were investigated in an empirical 
study — using data on engineering students — by Molnar and Delauretis 
(1973). The statistics used may be expressed as (1) [17] in the discrimin- 
ant space, (2) [19] in the discriminant space, (3) [14] with equal pj^-values 
[the equivalent of the statistic used in the BMD 5M program (Dixon, 1973)], 
and (4) [15] which Is equivalent to that used in the BMD 7M program. The 
second statistic yielded slightly better results than the first for 
one set of data. The first, third, and fourth statistics did equally well 
for a different data set. The purpose of such comparisons is not clear; 
conclusions about the relative efficiencies of the rules cannot be made 
from such a study. For the first set of data involving three groups, three 
two-group classification analyses were also carried out. 

In addition to discussing the ase of LDFs in classification. Overall 
and Klett (1972, Ch. 14) indicate that another orthogonal transformation 
may be useful for classification purposes. The transformation is obtained 
via a principal components analysis of the within-groups covariance matrix, 
S. The use of two LDFs and four principal components were compared for 
a set of data involving three criterion groups, 16 predictors, and nearly 
3000 Individuals. Results of maximum likelihood classification — as 
from 113] except that priors are not considered — applied In the two 
reduced spaces (of two and four dimensions) were very similar. 

^Much more empirical work needs to be done in the multi-group case of. 
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assessing the efficiency of the various classification statistics under 
different conditions. Comparisons within the set of linear rules, within 
the 'set of quadratic rules, and across the two sets remain problems for 
future study, as ('iO those involving the use of different priors and non- 
normal distributions. Elsenbels and Avery (1972, p. 53) conjecture that 
the use of linear versus quadratic techniques will affect the classification 
less than variation In prior probabilities. In a two-group study Anderson 
and Bahadur (1962) pointed out that deviations from normality may affect 
the results of quadratic classification much more than those of linear 
classification. The study of reduced space classification using different 
orthogonal transformations of the raw data In the dimension reduction 
may also be of Interest. 

Efficiency of Classification 

The results of any classification analysis may be summarized In a 
KxK classification table (or "confusion matrix"). The two dimensions of 
the square matrix are actual group membership and predicted group member- 
ship. One set of diagonal elements of such a cross-tabulation matrix give 
the number of "hits'* or correct classifications for each grqup. Data from 
this matrix may be used to test whether the classification procedure used 
Is significantly better than a purely random partitioning of the decision 
space; I.e., better than if assignments of individuals to groups were 
based on chancie alone. Since the only entries In the confusion matrix 
of Interest for this test are those on one of the diagonals, the usual 
Pearson chi-square test Is not appropriate. Significance by this test Is 
a necessary but not sufficient condition for concluding that the number 
of correct classifications is greater than would be expected by chance. 

Three statistics have been proposed for testing the efficiency of 
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a classification procedure; the referent distribution that^ay be used for 
all three Is the standard normal. They are reported by Lubln (1950), McHugh 
and Apostolakos (1959), and Press (1972, p. 382). (The second reference 
has a minor error In a formula used.) None of these tests Is strictly 
appropriate since the same data are being used to test the procedure as to 
define the procedure. If sufficient data are available, it might seem more 
appropriate to use a holdout sample to assess efficiency; however, as will 
be noted in the next two sections, better methods are available* The 
hit rate yielded by these better methods may then be compared to the expected 
hit rate based on chance alone, E Pj^ (Nj^/N) . 

The efficiencies of .two different classification procedures applied to 
the same data may be compared via HcNemar's test of related proportions. 
This test, and an extension of it proposed by W. G. Cochran for use in 
comparing more than two procedures, are discussed by Hays (1973, pp. 741, 
773). 

Estimating Error Rates 

Most of the work done with methods of estimating proportions of 
classification errors deals with the two-group situation. Much of this 
research will be reviev7ed in the next section. 

Three types of errors may be associated with a classification rule: 
(1) true error, (2) actual error, and (3) apparent error (see Hills, 1966). 
True (or optimal) error is the long-run frequency of mlsclasslf ications 
using a classification rule which assumes population parameters are known. 
Actual error is the long-run frequency of mlsclasslf icatlon using a rule 
which uses estimates of the unknown parameters. Apparent error is the 
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proportld^ of the "norming sample" misclassif led by a rule which us^es 
parameter estimates. — internal classification- As will be shown in the next 
section, estimators of true error and of actual error are simply formulated 
in the two-group case; however, the formulation for estimators in the multi- 
group case are complicated, indeed (Click, 1972), For large samples, true 
error and actual error will be approximately equal, and an estimator of 
one could be used for the other. Apparent error has often been used as an 
estimate of these two types of error. As might be intuited, since classify- 
ing the norming sample with a rule determined by this same sample is quite 
likely to capitalize on chance, apparent! error may grossly underestimate 
actual or true error. 

A better estimate may be obtained by extending a technique (Lachenbruch, 
1967) which was proposed for the two-groupVcase. This ("Jackknlf e") 



technique requires the application of a classification rule N («ENj^) times, 

\ \ 

withholding a different vector of measures ekch time. The individual whose 

r ' \ ' ' 

vector was withheld is then reclasjsif led usin{^ the statistics based on the 

other N-1 sets of measures. The proportions of misclassif led individuals 

■ \ ■ 

from each group are used as estimates of the cotiditional probabilities of 
misclassif ication. One minus the pioportion of misclassif ication across all 
K groups may be used as a measure of the discrimn^tory power of the predictors. 
Such a measure informs a researcher how well a sen of predictors differen- 
tiates the criterion populations, and it may serve \as a yardstick in 
determining whether the addition of new variables or; the deletion of old 
ones is warranted (Geisser, 1970, p. 60). 

Classification in Two-Group Case \ 

\ 

Rather than considering two linear discriminant scores (i.e., values, 
of L , in [14a]), only one comparison is involved when there are only 
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two criterion groups. The classification decision can be made by computing 

[22] - - (Xj^ - Ij) S-> c, 

■ \ ■ ■ ■ ■ ■ 

where 

c - h(^[ S-^Xj^ - S-^X^) + In P2 * 

It is noted that the vector of coefficients of the single LDF (see [3]) may 
be taken as \ 

, i ■ V'-. 

The decision rule is to assign individual i to the first population if 
v^ >_ c, and to the second population if v^^ X^ < c. The formal equiva- 
lence of two-group discriminant analysis and multiple regression analysis, 
where the criterion variable is measured by group membership, was mentioned 
previously in this paper. When the numbet of individuals in each of the two 
groups is the same, classification based on [22] is identical to that based 
Oil [21] for K = 2. 

It is in the two-group case where most work has been done in assessing 
the robustness of linear classification rules to various departures from 
assumptions. Investigations of linear rules for unequal covariance matrices 
have been performed by Kossack (1945), Smith (1947), and Gilbert (1969)* 
Studies involving the classification of non-normal data are reviewed in 
a later section. See also, Lachenbruch (1966). 

The use of prior probabilities or '*base rates" in a univariate classi- 
fication scheme was used about twenty years ago by Meehl and Rosen (195^) 
in a two-group study. This use of unequal priors to increase classification 
accuracy was critiqued by Cureton (1957). Overall and Klett (1972, pp. 
pn|p67) discuss unequal base rates used jointly with LDF scores in 
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graphically determining appropriate cut-off points for classification into 
one of two groups. Alf and Dor f man (1967) determine a cut-off score for a 
single predictor or a weighted sum of predictors such that the expected 
value of the decision procedure is maximized, taking gains and losses asso- 
ciated with correct and incorrect assignments into account. See also, 
Gregson (1964). 

Considerable theoretical and empirical research has been reported that 
deals with estimating probabilities of misclassif icatlon in the two-group 
case. Hills (1966) has given an excellent account of the problems intfelved 
in estimating various error rates in multivariate two-group classification 
problems. Hockersmith (1969) reviews numerous methods of estimating true 
and actual error — refer to the preceding section — and reports the 
results of a Monte Carlo study comparing the accuracy of the methods. The 
comparative accuracy of the methods depends upon the number of predictors, 
group size, and distance between the two population centrolds. It was 
generally concluded that a method using a holdout saiQ^le, where a subset 
of the original set of observations is classified using the rule determined 
by the remaining observations (the norming sample), was inferior to the 
others. As might be expected, it was concluded that apparent error was a 
poor method in nearly all situations studied. If one method were to be- 
selected when the normality condition Is questionable. It would be the 
"U method" suggested by Lacheribruch (1967) which was described in the pre- 
ceding section of this paper. \If the normality condition is net, a method 
which combines the features of the U method and the use of the normal 
distribution is recommended; here an estimate is taken as 



[23] (- L^/Sl ) + * (^2^\ )] . 
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where <I> is the standard noriri.il distribution function, is the mean of 
the Nj^ values of [22] in group k each based on + N2 -1 observations, 
and s- is the standard deviation of such values for group k. Other specific 
conclusions were reached which, along with the general conclusions, were 
comparable to those of Lachenbruch and Mickey (1968) in a somewhat similar 
study. See also the three papers by Sorum (1971, 1972a, 1972b). 

Results of studies by Lachenbruch (1968) and Hockersmith (19u9) 
have also led to conclusions regarding sample size. The recommendations made 
are dependent upon the number of predictors, the distance between the two 
populations, and the tolerance between the estimated and optimum error rate- 
Tables are provided by both writers which indicate a desired common sample 
size in different situations. Using [23] as an error rate estimate, sample 
size requirements are summarized by Lachtr^bruch (1968) as follows: (1) 
for large tolerance only small samples are needed; small tolerances imply 
the need for large samples; (2) groups widely separated need smaller samples 
for classification than groups that are close together; and (3) as the number 
of parameters increases, the required sample size to number of parameters 
decreases. Hockersmith (1969) draws similar conclusions, and specifically 
states that for the better error rate estimates, "...a sample size of 
40 in each group could be used to insure with some confidence that the 
estimate of (true) error will be within a tolerance of .05 Ip« 80].** 
See also, Dunn (1971). 

Specific Uses of Classification 
Classification has at times proved helpful when used in conjunction 
with other multivariate data analysis techniques. A use of classification 
procedures in a pred.-t^.on study is given by Lissitz and Schoenfeldt (1974); 
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probabilities determined by [19], withfequ^^iSEriors, were used as weights 
in a multivariate prediction model. Roge?5*"-€md Dinden (1973) transformed 
values obtained from [13] to probabilities of group membership that were 
used as classification sffatistics so as to test the efficiencies of three 
grouping (or clustering! methods on a given set of data. A classification 
procedure was also use^ by Schoenfeldt (1970) to validate a clustering 
method. It is noted that the purpose of this latter type of study is to 
define groups rather than to predict group membership; thus the usual tests 
of significance used in discriminant analysis do not apply (Friedman and 
Rubin, 1967, p. 1167). 

Jackson (1968) studied two methods of estimating unknown values in 
discriminant analysis and used as his criteridn of comparison che proportion 
of correct classifications (one minus apparent error) yielded by each 
method. The criterion of one minus apparent error was also used by 
Huberty (1971a) in assessing the effectiveness of various methods of 
delecting a subset o£ predictors of a given size. 

The close relationship of multivariate classification techniques to 
"profile analysis" is pointed out by Overall and Klett (1972, Ch. 15). 

Classification Research Applications 
As in the section on discrimination applications, only selected 
journal articles in behavioral research dealing with applications of classfl- 
fication procedures will be reviewed. Two sets of studies are reviewed: 
(1) studies dealing almost exclusively with classification, and (2) studies 
using both discrimination and classification techniques. 

Only one study selected (Doerr and Ferguson, 1968) carried out classi- 
fication in the reduced space. Eight vocational test scores and five interest 
Inventory scores for 982 high school students were used for assignment of 
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Individuals into one of eight vocational course groups. The null hypothesis 
of MANOVA was rejected. Two "significant" LDFs were determined by examining 
the ratios of the individbal eigenvalues ofE*"^!! to their sum, i.e., to the 
trace of E""^!!. A random 10% (from each group) was used for cross-validation 
purposes; both internal and external classification results were reported. 
The classification statistic used was not indicated; presumably it was [19] 
in the reduced space — no priors were specified. 

In some studies, conducted' for the purpose of prediction where the 
dependent measure is nominal, the classification statistic Is not made ex- 
plicit. In a study by Stahmann (1969) it was merely stated that, "Multiple 
discriminant analysis was used as a classification procedure ...[p. 110]/* 
That study involved approximately 500 bachelor degree graduates in five fields 
of study; ten academic^ test scores, nine occupational Interest inventory 
scores, and two self expressions of major field were used as predictor measures* 
Neither equality of covariance matrices nor centroids was considered. A 
holdout sample was employed for validation purposes; the proportion of the 
original sample was not specified. External classification results were re- 
ported for one data set, while internal classification was used for two other 
sets of data. Fourteen measures, seven of which were academic test scores, 
/On 160 college freshmen were used by Ghastian (1969) to predict membership 
on one of four classes, two auiio-llngual and two "cognitive." The hypothesis 
of equal mean vectors was rfej^eited. Total and separate group correlation 
matrices were given, Internal classification results via an unknown statistic 
were tabled; the need for external classification was noted, however. 
Pearson's c^hi-square statistic was used to assess the efficiency of classifica- 
tion. Multiple regression techniques were used with the same predictors, but 
to answer a different question. 




Four "intallective" and 30 "nonintellective** variables were used by 
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Keenen and Holmes (1970) in predicting membership in one of three groups 
(graduates, withdrawals, failures) of 364 college freshmen. The classifica- 
tion statistic used is [17]. A 50 percent holdout sample was used to 
validate the classification procedure; internal and external classification 
results were reported. The correlational statistic, 1-A, was considered 
since its use "...is presently felt to be more meaningful than F (a trans- 
formation of A) in evaluating the results of a discriminant analysis [p. 93]." 

See Alumbaugh, Davis, and Sweney (1969) and Cohen (1971) for examples 
of two-group classification studies. 

Five studies will now be mentioned that utilized both discrimination 
and classification techniques; for these studies only the statistical 
techniques used will be discussed. All but one of the studies reported 
only internal classification results. The classification statistic was not 
specified by Kirkendall and Ismail (1970), Southworth and Mornings tar (1970), 
or Asher and Shively (1969). In the first study Wilks' A was used to 
test for centroid differences among the three popiAlations. Standardized 
weights were used to assess variable contribution to the lone LDF which 
was retained due to the percent of "total among-group^ variatidn" absorbed. 
The generalized Mahalonobis distance statistic — labeled W in the present 
review — was used in the second study to test the null hypothesis of 
MANOVA. It was not specified which LDF weights were used to sort out the 
two most effective discriminators. Asher and Shively substituted mean 
values of variables within each of their four groups for missing data. 
Three LDFs were statistically significant, but only two were considered 
since' the two associated eigenvalues accounted for nearly 97% of the trace 
of E"^H. The type of weights used for interpretive purpoaes was not made 
explicit. 



Standardized weights were used by Neal and King (1969) and Wood (1971). 
Following the use of Wilks* lambda statistic, Neal and King carried out both 
an internal and external classification using, presumably, statistic [191 7" 
priors were not specified. A "chi-square Goodness of Fit test" was nsed 
determine if the observed distributions, via both internal and external 
classification, could have been obtained by chance. The results of the 
statistical classification were compared to those of a "configural analysis" 
by means of a "chi-square contingency testr" The BMD 5M program was used 

by Wood for his group assignment procedure; that is, statistic [14] lacking 

I - - 

the In term. The coefficients of the classification equations, [14al, 

were "scaled" to determine the "telative discriminatory power for each vari- 
able." 

Other Issues, Problems, Developments 
Regression Analysis and Classification 
The formal relationship between multiple regression analysis and two- 
group discriminant analysis was noted previously in this paper. There 
have been a few studies which have attempted to compare the classification 
efficiencies of the two methods on a given set of data. In one study 
(Alexakos, 1966), college grade-point average was us^d as the criterion 
measure for both analyses; in two other studieft- (Dunn,. 19.'9; Bledsoe, 1973) 
the criterion measute is different for the two analyses. -^In the first 
study the classification method used was not made clear; Dunn used the 
statistic g^ven in [13], while Bledsoe used [15]. From a statistical view- 
point, the appropriateness of such comparisons appears quest ioftable. Use 
of the same criterion measure in both analyses would ignore requirements for 
one or the other; also, the results of such a comparison could be different 
depending upon the classification statistic used. Using different criterion 
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'measures implies that the statistical predictions would not be comparable. 
Some substantive knowledge may be gained from such comparisons » however* 
Of course » the two atoLyses ?inswer different and, perhaps, interesting 
questions (Tiedeman, 1951). [See Rulon, et al. (1967, pp. 323-336.] 

A hybrid of the regression and discriminant analyses which has consider- 
able intuitive appeal is a "joint probability model" which was originally 
proposed by Tatsuoka (1956). This model considers information concerning ' 
group membership in combination with that concerning success or productivity % 
in a group. This is an extension of the classification problem in terms of 
applications to vocational and educational guidance. The approach is dis- 
cussed and illustrated by Tatsuoka (1971, pp. 237-242) and Rulon, et al. 
(1967, Ch. 10). As attractive as this approach may appear, it has not 
enjoyed widespread use; the Lissitz and Schoenfeldt (1974) study referred 
to earlier used the idea, though it was not very helpful. 



Non-Normal Data 



Although most research in discriminant analysis using non-normal data 



has dealt with classification, some work has been done in the area of 



discrimination. Variable selection was the concern of Elashoff , Elashoff , 



and Goldman (1967) with dichotomous discriminators in the two-group case. 



and of Hills (1967) with dichotomous and polychotomous variables in the 



case of two or more groups. The selection procedure used by Hills was 



based on one of the nearest neighbor allocation rules of Fix and Hodges 



(1951); this latter report gives some of the first work in nonparametric 



\ 



discriminant analysis. A theoretical paper by Raiffa (1961) deals with 




the problem of (sequentially) selecting from items which are scored 0-1, 



a subset which will discriminate two groups of individuails about as well 



as the original set. 
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General non-parametric or dlstrlbutlon-f ree, as well as specific 
discrete and other non-normal, univariate and multivariate classification 
procedures have been very adequately reviewed by Das Gupta (1973). Various 
procedures have been developed to classify Individuals or observations 
characterized by various types of variables. For example, Solomon (1961) 
and Cochran and Hopkins (1961) developed classification techniques for 
categorical variables; Bargmann (1962b) developed a t.echnlque to classify 
time dependent data; Kendall (1966) and Kossack (1967b) developed techniques 
for ordinal data; Fix and Hodges (1951) developed a non-parametric technique 
for variables with unspecified distributions. Applications of non-parametrlC 
techniques In behavioral research have been very limited; two applications 
of the analysis of Cochran and Hopkins have been reported by Toms and 
Brewer (1971) and Krugllck and Brewer (197A). See also. Overall and Klett 
(1972, Ch. 16). 

A multivariate classification procedure which can handle different types 
of predictor variables has been proposed and Illustrated by Henschke, Kossack, 
and Llssltz (197A). The technique used, which Is an extension of that pro- 
posed by Kossack (1967a) for the two-group case, accommodates rultlple groups 
and three different types of variables: Interval, ordinal, and nominal. 
It Involves the transformation of each variable type In an appropriate fashion 
so as to convert It to an essentially measureable variable with equal group 
covarlance matrices. The transformation of the nominal variable Is based 
on that used by Bryan (1961). Once the transformations are completed a 
LDF Is formed, and a Bayes classll Icatlon rule Is used. For a single 
set of data they found their generalized classification procedure to be 
"clearly superior" to the statistic [16] used on the sam6 data by Lohnes 
and Gribbons (1970). 
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Other comparisons of the efficiencies of nortnal-based and nonparametric 
classification rules for the K-group case are needed. Empirical studies 
extending comparisons made in the two-group case by Fix and Hodges (1952), 
Gilbert (1968), Gessaman and Gessaman (1972), and Hoore (1973) would be 
four possibilities. Another possibility is an extension of the Lachenbruch, 
Sneeringer, and Revo (1973) study. 

Incomplete Data 

A number of methods have been proposed for handling the problem of 
parameter estimation when data values are missing or unknown in a multivariate 
analysis. Afifi and Elashoff (1966) provide an extensive review of the 
literature dealing wit)||^this problem. The estimation of covariance and 
correlation matrices for a single population was studied by Tlmm (1970). 
Must studies of this kind assume the data are missing at random (see Rubin, 
1973). ^ 

As in other areas of discriminant analysis, most research dealing with 
incomplete data has been done for the two-group case. Jackson (1968) 
presents results of an en^pirical study where both the number of variables 
and number of individuals were very large. Her preliminary findings 
suggest that the far simpler method of using means for missing values gives 
results comparable with those of an iterative regression estimation technique. 
Probabilities of correct classification under eight methods of handling 
(randomly) missing values were studied by Chan and Dunn (1972) using Monte 
Carlo methods. The mean substitution method (again) and a principal 
component method were found, in general, to be superior to the 
other methods for cases considered^ These writers caution that their results 
may not hold up with non-randomly missing data, .with non-normal populations. 
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and with unequal population covariance matrices. For a third study Involving 
only two criterion groups, see Smith and Zeis (1973). See also McDonald 
(1971). 

Use of T Versus E 

In an early section of this paper a formulation for obtaining coefficients 
of LDFs was expressed in terms of the within-groups SSCP matrix, E (see 
equation [21]). Computationally, it would be equivalent to use 

It^^h - 0i| - 0 , 

which would lead to vectors of coefficientu which are proportional to those 
obtained using E (Rozeboom, 1966, p. 562; Porebski, 1966a). .In their 
formulations a few writers prefer the use of T, while most writers use E. 
The use of the T matrix was suggested by~Ottman, Ferguson, and Kaufman (1956) 
in obtaining classification equations as an alternative to those given by 
L^j^ - In pj^ (see [14a]). The classification statistic proposed by these 
writers is ^T""^X^ "^^^^ claim that one of the principal and 

unique advantages of such a formulation is that "...once the data for the 
general population are available, the general population can be further 
subdivided and more equations developed for an indefinite number of sub- 
populations [p. 80]." The modified statistic is easily amenable lor 
dropping, adding, or adjusting any of the criterion groups. 

In discrimination the important consideration in deciding which SSCP 
or covariance matrix(ices) to use pertains not to computations but to 
inferences the researcher wishes to make. Of course. Inferential statements 
are related to the sampling design of the investigation. If the different 
groups being studied do not represent natural subgroups of some larger pop- 
O ilations — e.g., an experimental study involving different treatments — 
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then It might seem appropriate to use T. However, when attempting to 
subBtantlvely Interpret the LDFs In such a situation the use of T Is 
Irrelevant since the LDFs have no population counterpart (Mulalk, 1972 , 
p^ A28). 

Reporting Discriminant Analysis Results 
No matter which purpose or. combination of purposes an analysis Is to 
serve. It Is recommended that the following be reported k (1) method of 
sampling, (2) data collection procedures Including cle^ descriptions of 
measures used, (3) number of Individuals In each Icrlter Ion group, (4) means 
and variances (or standard deviations) on each variable for each group, and 
over all groups combined, and (5) the pxp correlation matrix based on E. 
In addition, the computer program(s) used — e.g., from a package, or 
self-written — should be specified. 

When separation is considered, univariate statistics (e.g., ANOVA 
F-values or the transformed correlational indices such as u)^) should be 
reported. Some assessment of the equal group covarlance structure condition 
is recommended, such as group covarlance matrix determinants or value of 
a test statistic. The statistic used in testing the null MAMOVA hypothesis 
should be reported — the type as well as a numerical value. 

For discrimination the reporting of t\ie above Information plus more 
is recotranended . First of all, the coefficients should be given. Indicating 
whether they are applicable to raw scores or standardized scores. Also, 
if discriminator versus LDF correlations are used, they oiight to be reported, 
indicating whether they are based on the total-group formulation {51 or the 
wlthln-groups formulation [6]. If it is inferred that some discriminators 
could be deleted. in subsequent similar studies, coefficients for the retained 
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variables should be recomputed and reported. If the researcher favors the 
Interpretation of functions beyond that associated with the largest root of 
E~%, then It Is recommended that two-dimensional plots of centrolds be 
presented. 

Further, assuming significant separation Is determined, reporting an 
estimate of the proportion of variance of the p variables that Is attributable 
to centrold differences Is recoimnended. Estimates of palrwlse group distances 

(between centrolds) using [9] may also be Informative. 

\ 

Certain Information ought to be made explicit when reporting results 
of a classification study. Here too, values used In assessing group covar- 
lance structure should be reported. Reporting the classification rule or 
statistic used Is also advised along with the priors used. It Is further 
recommended that a table of hits and misses be given using both an Internal 
and an external classification method* 

General References ^ 
The sources, cited here are restricted to those that can be used as 
references for discrimination and classification. All of them were referred 
to at least once earlier In this paper. < 

To date, the best references* In the opinion of this writer, for dis- 
cussions on discrimination are not found In books on multivariate methods. 
Four of these. In order of preference, are Tatsuoka (1973a), Porebskl, 
(1966b), Bargmann (1970), and Bock and Haggard (1968). In the Tatsuoka 
chapter. Issues and problems In Interpretation of LDFs are scattered through- 
out. A very readable discussion of the basic mathematics involved in 
discrimination and of an approach to Interpretation is provided in a pamphlet 

by Tatsuoka (1970). An elaboration of this coverage is also provided by 

/' 

o 
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the same writer (Tatsuoka, 1971)* Brief discussions are given In two chap- 
ters by Cooley and Lohnes (1971, Chs* 9 and 12) and at the very end of the 
fine book by Mulalk (1972). Tatsuokais Interpretations are based on stan- 
dardized weights, while Cooley and Lohnes and Mulalk prefer the variable- 
LDF correlations. Elsenbels and Avery (1972) give a good discussion of the 
problem of variable selection. 

Tatusoka (1971) also provides a coverage of classification procedures. 
Including a good discussion of posterior probabilities of group membership. 
An excellent general discussion of classifications based on posterior 
probabilities Is given by Overall and Klett (1972), while Elsenbels and 
Avery (1972) provide a discussion of the estimation of error rates. For 
the educational researcher, these latter two books suffer from the drawback 
of the lack of appropriate Illustrations; Elsenbels and Avery also have some 
annoying errors In their expressions for a few statistics. The books by 
Cooley and Lohnes (1971) and Press (1972) are the only ones of those re- 
viewed that present Gelsser*s classification statistic based on posterior 
odds — [16] In the former and [20] In the latter. The book edited by 
Cacoullos (1973), basically one on multivariate classification, contains 
at least six very readable papers, all referred to earlier, plus a rather 
extensive bibliography at the end of the book* The review by Das Gupta Is 
highly recommended. 

Computer Programs 

One or more of a number of statistical computer "packages" are readily 
accessible at most Institutions ~ BMD, OSIRIS, SAS, and SPSS are popular 
packages. The three "discriminant analysis" programs In the BMD package 
(Dixon, 1973) have been reviewed quite extensively elsewhere (Huberty, 
1974a). The single program In the IBM Scientific Subroutine Package (SSP) 
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Is the same as the BMD 5M program, "Discriminant Analysis for Several Groups." 
The discriminant "functions" yielded by the BMD multi-group programs are 
the equations aa In [14], not those found via [3]. 

There are some books that list a number of computer programs (e.g., 
Veldman, 1967; Coo ley and Lohnes, 1971; Overall and Klett, 1972). As can 
be determined from the description given, the output of VeldmF.n's discriminant 
analysis program (DSCRIM) Includes the varlable-LDF correlations as given In 
[6] and the group centrolds. To classify Individuals using Veldman's 
program It Is necessary to use his cluster analysis program (HGROUP) which 
uses the statistic [13] . The discriminant analysis program of Cooley and 
Lohnes (DISCRIM) yields standardized weights, varlable-LDF correlations as 
given In [5], the value of 1-A, and "communalltles" as given In [8]« Their 
classification program (CLASIF) utilizes the method of Gelsser [16] for 
internal classification with prior probabilities defined by group sizes 
relative to N. The equality of the population covariance matrices is 
tested with their MANOVA program; however, no quadratic classification results 
are possible. The discriminant analysis program of Overall and Klett , ' 
provides output similar to that of Cooley and Lohnes, plus pairwlse distance 
measures (see [9]). The statistic used in their classification program is 
[15]; internal classification is also possible in a reduced space deter- 
mined by orthogonal transformations of the original p measures. 

A discriminant analysis program is also given by Elsenbeis and Avery 
(1972). ^ This program, which is reportedly available from The University of 
Wisconsin for a cost, provides considerably more output information for 
purposes of variable selection and of classification than those previously 
mentioned. No varlable-LDF correlations are computed, however. The test 



62 



of equal covariance matrices Is carried out, followed by the use of either 
a linear [14] or a quadratic [18] classification statistic; reduced space 
classifications are also optional. The Lachenbruch (1967) jackknlfe method 
of estimating the probability of mlsclasslf Icatlon is used in this program. 
The combination method of variable selection described in an earlier 
section is utilized. 

A new BMD program, discussed by Dixon and Jenrich (1973), is now 
available; it requires some special hardware, and may be obtained for a 
small cost. The program has three very promising added features: provision 
for (1) more meaningful graphic Interpretation of results, (2) the handling 
of the unequal covariance structure situation, and (3) specifying relative 
costs of mlsclasslf icatlon as veil as differential prior probabilities 
for each group. 

There are a few very general multivariate programs (e.g., those by 
Elliot Cramer and by Jeremy Finn) that are available to users. A program 
called MUDAID, which is used extensively at The University of Georgia, 
is an updated version of that by Applebaum and Bargmann (1967); it is now 
used mostly on the CDC 6400. The Cramer, Finn, and Bargmann programs are 
used basically for separation and discrimination, with vajrying outputs. 
Other individual specific computer programs useful in discriminant analysis 
are available. References to many programs can be found In the journals. 
Educational and Psychologies 1 Measurement and Behavioral Science . 
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